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5 CROSS-REFERENCE TO RELATED APPLICATIONS 

This application claims priority under 1 19(e) to Provisional Patent Application 
No. 60/063,679, which was filed on October 29, 1997. 

FIELD OF THE INVENTION 

The present invention pertains to methods for elucidating the function of 
^ 1 o proteins and protein domains by examination of their three dimensional structure, and 

1 more specifically, to the use of bioinformatics, molecular biology, and nuclear magnetic 

jj; resonance (NMR) tools to enable the rapid and automated determination of functions, as 

a means of genome analysis. The present invention further pertains to an integrated 
system for elucidating the function of proteins and protein domains by examining their 
15 three dimensional structure. 



BACKGROUND OF THE INVENTION 

One of the most powerful ways of identifying the biochemical and medical 
function of a gene product is to determine its three-dimensional structure. Although 
there are numerous examples in which the primary (i.e., linear) structure of a protein has 
20 provided key clues to its biochemical function, three dimensional (3D) structure 

determination is considered to be more definitive at establishing biochemical function. 
The process of elucidating the 3D structure of large molecules, such as proteins is 
generally thought of as slow and expensive. 

In the past, most drugs were discovered by screening proprietary chemicals with 
25 animal models or receptor libraries. Today, this approach is being replaced by 

"combinatorial chemistry" and "rational drug design". These are the primary methods 
being used in the development of, for example, drugs targeted at the enzymes of the 
human AIDS virus. 

What limits the drug discovery process today is not screening or medicinal 
30 chemistry but the rate that the approximately 100,000 proteins in the human body can 




be identified and prioritized as potential drug targets. Of particular significance for the 
pharmaceutical industry are the emerging disciplines of bioinformatics and functional 
genomics. Application of technologies developed in these areas will allow companies 
to identify, in the next decade, the bulk of the most significant new drug targets. It has 
been estimated that about 10,000 genes from the human genome are of potential value 
in human medicine, but only a few percent of these genes have been isolated so far. 
However, it is reported that by the year 2005 the raw sequence data for all of these 
genes will have been determined by the Human Genome Project (HGP). 

I. PROTEIN STRUCTURE 

It is a generally accepted principle of biology that a protein's primary sequence 
is the main determinant of its tertiary structure. Anfinsen, Science 757:223-230 (1973); 
Anfinsen and Scheraga, Adv. Prot. Chem. 29:205-300 (1975); and Baldwin, Ann. Rev. 
Biochem. 44:453-475 (1975). For over a decade, researchers have been studying the 
theoretical and practical aspects of the folding of recombinant proteins. 

For example, the "genetics" of protein folding using mutants of bovine 
pancreatic trypsin inhibitor (BPTI) has been studied. Mutants of BPTI were prepared in 
which several cysteine residues were replaced by alanine or threonine residues. These 
mutants were then expressed in a heterologous E. coli expression system. Although 
these mutants were found to fold into the proper conformation, the rate of the mutant 
folding was somewhat slower than that exhibited by wild-type BPTI. Marks et al. , 
Science 325:1370-1373 (1987). 

Ma et al. have also studied the genetics of protein folding using mutants of 
BPTI. Ma et al, Biochemistry 5(5:3728-3736 (1997). The model system described by 
Ma et al. predicts that a "rearrangement" mechanism to form buried disulfides at a late 
stage in the folding reaction may be a common feature of redox folding pathways for 
surface disulfide-containing proteins of high stability. 

Nilsson et al. have reported that factors, such as peptidyl prolyl isomerase, 
protein disulfide isomerase, thioredoxin, and Sec B, may interact with the unfolded 
forms of specific classes of proteins, while members of the hsp70/DnaK and 
hsp60/GroEL molecular chaperone families may play a more general role in protein 
folding. Nilsson et al, Ann. Rev. Microbiol. 45:607-635 (1991). Nilsson et al. further 
disclose that intrinsic folding rates, or even translation rates, of nascent proteins may be 
optimized by natural selection. Secretion, proteolysis and aggregation are other in vivo 
processes that depend greatly in the folding behavior of a given protein. Thus, protein 
folding involves an interplay between the intrinsic biophysical properties of a protein, in 



both its folded and unfolded states, and various accessory proteins that aid in the 

^'proteins are generally composed of one or more autonomously-folding units 
known as domains. Kimet al.,Ann. Rev Biocnem. 59:63 1-660 (1990); Nilsson 

5 Ann Rev Microbiol. 45:607-635 (1991). Multidomain proteins in higher organisms are 
encoded by genes containing multiple exons. Combinatorial shuffling of exons dunng 
evolution has produced novel proteins with different domain arrangements having 
different associated functions. This is thought to have greatly increased the ability of 
higher organisms to respond to environmental challenges because, via .combinational 

10 events, it has enabled genomes to readily add, subtract, or rearrange d 1S crete 

functi nalitieswithinagivenprotein. ^>™<™™^^%L 
Opin. Struct. Bio. 4:383-392 (1994); and Long et al, Science P2:12495-12499 (1995). 

II. INTERPRETATION OF A PROTEIN STRUCTURE 

Several methods have been used to elucidate the 3D structure of a given protein 
15 molecule. Chiefly, these methods are X-ray crystallography and Nuclear Magnetic 
Resonance (NMR). 

A. X-ray crystallography 

X-ray crystallography is a technique that directly images molecules. A crystal 
of the molecule to be visualized is exposed to a collimated beam of monochromatic X- 
20 rays and the consequent diffraction pattern is recorded on a photographic film or by a 
radiation counter. The intensities of the diffraction maxima are then used to construct 
mathematically the three-dimensional image of the crystal structure. X-rays interact 
almost exclusively with the electrons in the matter and not the nuclei. 

The spacing of atoms in a crystal lattice can be determined by measuring the 
25 angle and intensities at which a beam of X-rays of a given wave length is diffracted by 
the electron shells surrounding the atoms. Operationally, there are several steps in X- 
ray structural analysis. The amount of information obtained depends on the degree of 
structural order in the sample. Blundell et al. provide an advanced treatment of the 
principles of protein X-ray crystallography. Blundell et al, Protein CrystaUopaphy, 
30 Academic Press (1976), herein incorporated by reference. Likewise, Wyckoff et al 
provide a series of articles on the theory and practice of X-ray crystallography. 
Wyckoff* al. (Eds.), Methods Enzymol. 114: 330-386 (1985), herein incorporated by 
reference. 



B. Nuclear Magnetic Resonance (NMR) 

The classical approach for the analysis of NMR resonance assignments was first 
outlined by Wuthrich, Wagner and co-workers. Wuthrich, "NMR of proteins and 
nucleic acids" Wiley, New York, New York (1986); Wuthrich, Science 243:45-50 
(1989); Billeter et al.,J. Mol. Biol. 755:321-346 (1982), all of which are herein 
incorporated by reference. For a general review of protein determination in solution by 
ZL magnetic resonance spectroscopy, see WUthrich, Science 2,3:45-50 (1989). See 
also, Billeter et al.,J. Mol. Biol. 755:321-346 (1982). 

Wuthrich's classical approach can be briefly summarized in the following seven 

StCPS Step 1: Identification of individual resonances associated with each spin 
system, and designation of key atom types (e.g., H N , H°, N, C a , 
C p , etc.). 

Step 2: Classification of each identified spin system with respect to one 

or more possible amino acid residue type(s). 
Step 3: Identification of possible sequential relations between spin 

systems using inter-residue NOESY or triple-resonance data. 
Step 4: Unique mapping of strings of sequentially-connected spin 

systems to segments of the amino acid sequence, thus establishing 

"sequence specific assignments." 
Step 5: Extension of assignments to resonances of peripheral side-chain 

nuclei in each spin system, and determination of stereospecific 

assignments. 

Step 6: Generation of distance constraints using assigned resonance 
frequencies to interpret NOESY, scalar-coupling, and 
hydrogen/deuterium-exchange data in terms of "sequence-specific 
distance constraints." 
Step 7: Structure generation using these constraints. 
Automated implementation of these methods have made use of exhaustive 
search, constraint satisfaction, heuristic best-fit or branch-and-bound limited search, 
genetic neural net, pseudoenergy minimization, and simulated annealing satisfaction. 
Billeter et al.,J. Magn. Resonance 76:400-415 (1988); Zimmerman et at, In: 
Proceedings of the First International Conference of Intelligent Systems for Molecular 
Biology. Washington: AAAS Press (1994); Zimmerman et al.,J. Biomol. NMR 4:241- 



256 (1994); Zimmerman et al, Cur, Opin. Struct Bio. 5:664-673 (1995); and 
Zimmerman et aL,J. Mol. Bio. 269:592-610 (1997). A . A tAi m Wel 

Under traditional methodology, before a given protein is studied at the 3D level, 
the researcher had already obtained detailed experimental information regarding the 
protein's function and characteristics. The 3D structure is typically the last of many 
experiments performed over many years of study. The 3D structure information is then 
used to refine the researcher's understanding of the given protein. Thus, under 
traditional methodology, it is very rare that the 3D structure of a given protein is 
determined before its biochemical function has been determined by other methods. 

The present invention represents a paradigm shift in methodology because the 
researcher would first determine the 3D structure of a protein of unknown function and 
then use this structure to gain clues as to its function, which would be subsequently 
validated by appropriate biochemical assays. 

SUMMARY OF THE INVENTION 

The present invention describes an integrated system for rapid determination of 
the three-dimensional structures of proteins and protein domains and application of this 
technology in a high-throughput analysis of human and other genomes for drug 

discovery purposes. 

The "structure-function analysis engine" described herein has the potential to 
discover the functions of novel genes identified in the human and other genomes faster 
than existing genetic or purely computational bioinformatics methods. 

The present invention employs: 

1 Bioinformatics methods, including the analysis of exon-exon phases and 
other methods for segmenting or "parsing" DNA sequences of novel 
genes into domain-encoding regions; 

2 Robust and general "domain trapping" methods for producing correctly- 
folded recombinant protein domains of novel biomedically-important 
human disease gene products; 

3 Robust and general methods for high level expression and isotopic 
enrichment of these domains for NMR and X-ray crystallography 
studies; 

4. Screening methods to identify protein domain constructs that exhibit the 
properties required for structural analysis by NMR or X-ray 
crystallography; 



5 Computer software, NMR pulse sequences, and related NMR 

technologies that provide fully automated analysis of protein structures 
from NMR data; 

6. NMR spectroscopy methods for determining 3D structures of these 
domains; 

7. Improved methods for mapping new domain structures to proteins in the 
Protein Data Bank that have similar structures and biochemical 
functions; 

8 A relational data base of the empirical properties of expressed domains 
for organizing and integrating the biophysical and biological information 
derived from these studies, as well as methods for making such relational 
data bases; and 

9 A method for integrating all of the above into a large-scale, high- 
throughput macromolecular "structure-function analysis engine," and the 
application this "structure-function analysis engine" to the discovery of 
biochemical functions of hundreds of genes from humans and human 
pathogens. 

The specific biomedical gene targets that this technology can be used to develop 

include: . 
1. Domains from the human Alzheimer's 0 peptide precursor protein 

(APP). 

2 Domains from other proteins genetically implicated in neoplastic, 
metabolic, neurodegenerative, cardiovascular, psychiatric and 
inflammatory disorders. 

3. Domains from proteins associated with infectious agents (e.g., bacteria, 

fungi and viruses). 

The present invention provides a high-throughput method for determining a 
biochemical function of a protein or polypeptide domain of unknown function 
comprising: (A) identifying a putative polypeptide domain that properly folds into a 
stable polypeptide domain, the stable polypeptide having a defined three dimensional 
structure; (B) determining three dimensional structure of the stable polypeptide domain; 
(C) comparing the determined three dimensional structure of the stable polypeptide 
domain to known three-dimensional structures in a protein data bank, wherein the 
comparison identifies known structures within the protein data bank that are 
homologous to the determined three dimensional structure; and (D) correlating a 
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biochemical taction corresponding to the identified homologous structure to a 
biochemical taction for the stable polypeptide domain. 

The present invention further provides an integrated system for rapid 
determination of a biochemical function of a protein or protein domain of uitaown 

5 function: (A) a first computer algorithm capable of parsing the target polynucleotide 
into a. least one putative domain encoding region; (B) a designated lab for expressing 
the putative domain; (C) anNMR spectrometer for determining individual spin 
resonances of amino acids of the putative domain; (D) a data collection device capable 
of collecting NMR spectral date, wherein the data collection device is operatively 

,0 coupled to die NMR spectrometer; (E) at least one computer, (F) a second computer 
algorithm capable of assigning individual spin resonances to individual ammo acids of a 

a polypeptide, wherein the polypeptide has had resonances assigned to ,nd.v,dua> ammo 
acids of the polypeptide; (H) a database, wherein stored within die ^database is 
1 5 information about the structure and taction of known proteins and determined proteins, 
and (I) a fourth computer algorithm capable of determining 3D structure homology 
between die determined tee-dimensional structure of a polypeptide of unknown 
taction to three-dimensional structure of a protein of known taction, wherein die 
protein of known structure is stored within the protein database. 
20 The present invention further provides a high-throughput method for 

determining a biochemica. function of a polypeptide of unknown taction encoded by a 
target polynucleotide comprising the steps: (A) identifying at least one putative 
polypeptide domain encoding region of the target polynucleotide ("parsing ); (B) 
^pressing the putative polypeptide domain; (C) determining whether the expressed 
25 putative polypeptide domain forms a stable polypeptide domain having a defined three 
dimensional structure ("trapping"); (D) determining the three dimensional structure of 
the stable polypeptide domain; (E) comparing the determined three dimensional 
structure of the stable polypeptide domain to known three dimensional structures m a 
Protein Data Bank to determine whether any such known structures are homologous to 
30 the determined structure; and (F) correlating a biochemical taction corresponding to 
the homologous structure to a biochemical function for the stable polypeptide domain. 

BRIEF DESCRIPTION OF THE FIGURES 

Figure 1 provides a flow chart of the high-throughput structure/function analysis 
system of the present invention. 



-8- 



10 



15 



25 



Figure 2A provides the far UV circular dichroism spectra of the purified 
recombinant APP NTD2-3 domain. Figure 2B provides the near UV circular dichroism 
spectra of the purified recombinant APP NTD2-3 domain. 

Figure 3 provides a NMR spectra of the purified recombinant APP NTD2-3 • 
Figure 4 provides a hydrogen-deuterium exchange time course for the purified 

recombinant APP NTD2-3. . 

Figure 5 provides the results of a cooperative thermal unfolding expenment of 

the purified recombinant APP NTD2-3. 

Figure 6 provides the results of the NMR 15 N-'H heteronuclear single quanton 
coherence (HSQC) spectral analysis of the NTD2-3 domain collected on a Vanan Unity 

500 spectrometer. - inor 

Fi C ure7providesthe2D»N- l H>'HSQCs Pe ctr U mofCspAatpH6.0and30 C. 

Figure 8A provides an illustration of information derived from triple resonance 
data sets used for establishing intraresidue and sequential correlations , rf «. ^systems. 

Figure SB provides an illustration of NMR data used to identify structural 
dements in CspA. Slowly exchanging backbone amides ft. > 3 min a. pH 6 0 and 
30°C) are indicated by filled circles («,„< 30 min) or starts (,,„> 3C I nun ; > Values of 
> »H»-H«) coupling constants are indicated by vertical bars; filled bars mdtcate that ft 
data provided a useful estimate ( ± 0.5Hz) of the corresponding coupling constant, whtle 
open bars indicate tha, the experimental data provide only an upper bound on Us vahre. 
Values of conformation-dependent secondary shifts A5C" arrd AoC» are plotted wrth 
solid bars. The locations of the five P-strands are indicated with arrows. 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 

One of the best clues to a protein's function is its structure. The present 
invention describes a structure-based bioinformatics platform to be used in functional 
genomics" analyses of the torrent of DNA sequence data emerging from the 
international HOP. This technology will allow for the isolation of novel 
biopharmaceuticals and/or drug targets from gene sequence information wrft an 
efficiency that is far beyond present day capabilities. By developmg extreme! ast ye. 
rigorous technologies for macromolecular structure determination, it ,s poss.b e ,„ 
convert the stream of one-dimensional DNA sequence information emergmg from 
human genome research efforts into 3D protein structures. This 3D structural 
information can then be used to map these human gene products ,0 protem famrhes wth 
similar biochemical functions. 



The present invention describes a "drug discovery search engine" that allows 
human genetic and genomic data to be smoothly interfaced with proven rational drug 
design and combinatorial chemistry approaches. The technology descnbed herem 
enables determination of the structures for virtually the entrre complement of human 
protein domains, encoded in the approximately 100,000 human genes. 
HI. STRUCTURE SUGGESTS FUNCTION 

It is a tenet of modern structural biology that structure suggests function: a 
given protein "fold" tends to be used over and over again in nature for a restncted set of 
Z* Knowledge of the structure of a new protein often reveals krnshrp 

to a flily of other proteins with already known functions and thus P-des a^g 
clues regarding the biochemical function of the protein at hand. Holm e, al. .Scteuce 

™»3 0996); Borkera/., Cur, Opi, Struct. Bio. «93-403 (1994); Brenner*, 
a, Proc mi. Acad. ScL (U.SA). 95:6073-6078 (1998), al. of which are herem 
mcorporatedby reference. This kinship relationship is a natural manifestation of the 
fl I, families of protein molecules have evolved from a common ancestial module, 
1 ma, in the course of mis evolution the 3D structure is largely preserved new, 
though chemically related, biochemical functions are adopted. Thrs rs precse y tire 
reining behind me assigning of "expressed sequence tag" (EST) sequences to known 
protein families using one-dimensional sequence comparisons. 

Evolution generally acts to conserve 3D stiuctures rather than the ammo «rf 
sequences of proteins. For mis reason, proteins have often evolved over time so ma. 
their sequences exhibit no obvious similarity while their structures reman, lughly 
nTmologou, !„ practical terms, mis means mat simple sequence compares overlook 
Zy - and perhaps even most - instances of protein-protein relatedne ss However, 
L7relatedness, win, aU of its functional implications, can easily be .dentifled by 3D 

structure comparisons. 

The multidomain nature of many mammalian protems makes them more 
difficult to express in recombinant form and also impedes their structure determmation 
by X-ray crystanography orNMR. The expression and structure determmation of an 
, iLated doLn is in contrast, less problematical. Since an isolated doma,n compnses 
one or more discrete functional unto in a protein, knowing stracture-function 
information about a given individual domain in a multicomponen, protem genemU 
provides key information tha, can be used to proceed with drug deveiopmen. on the m 
pro,e n. The "domain trapping" methods of the present invent™ generate many 



,0. • 

novel gene products suitable for structural analysis by NMR spectroscopy and X-ray 

^^developments in the areas of high-level protein expression technology 
X-ray crystalography, heteronuclear NMR spectioscopy, and artificial intelligence (M)- 
hased sltura. analysis software, have dramatically improved the speed and lowered 
the cos. of protein stmcture determination. Estimates of the total number of human 
genes in ore genome (approximately 10=) contrast dramatic* with 
L number of protein folds in nature (approximately 10"), and it has been — ed 
tha, one-third to one-half of these folds have already been described. Chothta , * al, 
Nature 357 543-544 (1992). Simple statistics imply that many new gene products wul 
exhibit structures tha, map to existing fold classes associated with proteins of known 
biochemical function. Thus, tire harvest of functional information about new human 
genes from this approach will be immediate. 

ittrqirN OF A HIGH-THROUGHPUT SYSTEM FOR 
IV - g|S%NG PROTEIN STRUCTURES AND 
FUNCTIONS 

Figure 1 provides a flow chart of the high-throughput structure/function analysis 
nsed in tire present invention for analyang human and pathogen gene product, Tta 
flow char, outlines the general methods of the present invention. Each sub-aepof the 
present invention is outiined in detai, below. 1, is .o be understood mat fte hardware 
disclosed herein can be or is operatively linked to one or more computers. 

A. Approaches for identifying novel protein domains 

The present invention provides a method for predicting the location of domains 
and domain boundaries within a given DN A sequence. Under one embodiment, tins ,s 
accomplished through a knowledge based application which segments or parses 
genomic or cDNA sequences of genes into domain encoding sequences. Under another 
embodiment, the knowledge based application of the present invention can also segment 
or "pane" mRNA sequences into domain encoding sequences. Preferably, the 
knowledge based application of the present invention is encoded within a computer 
, algorithm software application. Preferably, this expert system applies rules developed 
on a set of experimentally-verified DNA sequence/protein domain compansons that 
have been compiled from public sequence and protein structure databases. Thus for a 
novel gene sequence, this expert system generates tine predicted domams and/or domam 
boundaries which are then used to create domain-specific expression constructs. 
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Under one of the preferred embodiments, the gene sequence is parsed by the 
exon phase rule. Exon termini (5'- or 3') that begin or end within protein codmg regxons 

codons is called a "phase 0" terminus; an exon terminus that starts or stops after the first 
nu el einthecodonisc^ 

example, where ("*") marks the positions of an exon-exon junction- 



Phase 0: * 

5' ... -AiG-O^A-CiC- - 3' 
... - Met - Gly - Leu - ... 
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Phase 1: * 
5' ... -AiG-GjGjA-CiC- ... 3* 
... - Met - Gly - Leu - ... 

Phase 2: * 
5' ... -AiG-GjGjA-CiC- ... 3' 
... - Met - Gly - Leu - ... 

The genetic coding sequences for protein domains, which have been reported to 
have been "shuffled" between various genes during evolution, should be bounded by 
Tn^in of the same phase (or by the N- or Oterminal ends of the ho.oprotein), 

station in the downstream sequences upon spUcing (Patihy, <?*>*££™ 
Patthy, FEBSLeUers 2,4:1-1 (1987); Patthy, Cur. Opm. S»uC B,o. «83-392 (1994), 
fof which are herein incorporated by reference). Therefore, the domam en ec*ng 
regions should be bounded on bo* sides by phase 0 exon termun, by phase ! exon 
,eLiniorbypl^2exon,ermim,butnotbytenniniofdifferen.phases. 

As ^ of the mechanism of molecular evolution, structural and functional 
domains are mixed and matched between protein sequences through the processes of 

".tionand — 
are identified by iooking for segments of gene sequences that are conserved ac o s 

Z» genes from different organism, Know, domain families generally uwo.ve 50 - 
many genes tr rf mgaA prolems 

300 amino-acid long segments that are ODserveu y „„„,.„„.„-. 
Bioinformatics algorithms capable of identifying these conserved segments, or gene 
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fragmen. clusters, in the data base of gene sequences have been reported. The*: 
algorithms can be used to identify candidate domain-encoding regrons m novel gene 
seances. Gouzey e, al, Tren* Biocnen. SO. 27:493 (1994), herein rncorporated by 

5 Under a second preferred embodiment, domains from gene sequence data are 

identified through predictions of their interdomain boundaries. There is ample evrdence 
from molecular evolution and ceU biology studies ma, informal regardmg domam 
Varies is embedded in the sequences of protein coding gene, Some repom have 
daimed mat rare codon cluster which cause nbosomal pausing dunng —on, are 

,0 corre.a,ed™«hd„mamboundarie,Puxvis«^,i.Mo ( . B ,o/.W5:413-417(1987, 

Nilssonerd Ann. Re, Microbiol 45:607-635 (1991); Thanara, e, al.ProiemSa. 

, (.996); Thanaraj - al, Proiein Sci. 3:1594-16.2 (1996); and Ourseze, al, 

J. Theor Biol J«:243-252 (1993), all of which are herein incorporated by reference. 

Messenger RNA secondary structure have also been reported to play such a 

15 "ounctuation" role during translation. 
0 OneembodimentofmepresentinventionemploysanalgorrmmUtattdenUfie 

S such sequence features and compares these data with the actual domain sequences » the 

» relational database of the present invention. The relational database of me present 

invention contains domain sequence information of known and determmed prote n 
domain, I, is understood that the relational database of the present rnven «o» , wr,^ 
expand over time such that each polypeptide domain determmed usmg the methods o 
I plen, invention will be added ,0 tire National database. Under tins embodiment, 
i, is possible to rigorously assess the reliability of these bioinformat.es methods of 
domain prediction and, iterative.,, modify the software to improve ,<s rehabth^ 
Neural nets and genetic algorithms bom can be used for denvmg rules for domain 

by greatly reducing the number of expression constructs that would have ,„ be tested tn 
oiler to correctly parse a novel gene sequence into its component domam sequences. 

Under JoL embodiment, the solution structure of a protein or protem domam 
can be analyzed by a method that combines enzymatic protects and mamx ass,«ed 
,aser desorption ionization mass spectrometry (Cohen * al, Proiem Sc^.1088-1099 
(,995) Seielstad e, al, Biochem. 34:12605-12615 (1995), both of winch are 
„corpora.edbyreferenceinmeirentirety).Thismemo4iscapableofmfernng 

structura! information from determinations of protection against enzymatic pro eolysrs 
as governed by solvent accessibility and protein flexibility. Preferably, the proteohc 
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enzymes =mp.oyed by this method include trypsin, chymotripsin, thermolysin, and 

ASP-N endoprotease. 

r "Domain Trapping": Expression and biophysical 
eharactertaation of putative recombinant protem 
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With respect to genes of unknown function, the investigator, generally, does not 
have available an enzyme assay or other obvious activity-based means ,0 assess the 
^chemical activity of a novel recombinant protein domain The present mvenuontas 
itL this diffLy in a three-pronged manner. Firs,, the present mvenuon uses a 
reliable and high yield expression system for protein expressron. For exampte a 
secretion-based protein A fusion system that is one of the most tested and rehaHe 
methods known for producing correcdy-folded recombinant prolans m the £ col, 
^plasm. Nilsson e, ^ol. ^.44-161 (1990), herem ,n^ 
Inference. Alternatively, the pET plasmid expression system may be used. Stirdter 
lal J Uol Bio. M o ;11 3-130(1986),herein incorporated byreference. Second, me 

present invention uses a set of activity-independent biophysical criteria to assess 
present mvenuuuu This se, of criteria has been developed 

whether the protein domain has properly folded. Tins set ot cntena 
through extensive study of recombinantly-expressed protem foldmg mutants. F*. 
based on the supposition that autonomous folding of the protein domam can be 
p : n ed due » too much or too .Me polypeptide sequence informauon respecuvely 
7L,a, Ann Re, BiocHe*. 5y:631-6« (1990); Nilsson 
5 Z> -635 (1991), bom of which are herein incorporated by reference), the presen 
nation uses systematic strategies for identifying and trapping domains that enable., 
72 a combinln of molecuiar biological and biophysical methods -xpenmen* 
pi any gene into its component domain, .n other words, a Hypept.de omam has a 
"defined three dimensional structure" when ma, polypeptide domam exmbtts the 
activity-independentbiophysicaUriteriaofaproperlyfoldeddomam 

Under one preferred embodiment, an activity-independent b.ophystcal catena 
usedtoassessmecorrecttessoffoldingofaproteinincludescirculardichrotsm 

m«ents. More preferably, characterization o, an isolated domain of a protem rs 
ITyzed by circular dichroism measurements in tire far UV. An eilipfcty mmtmum a 
m L is Licative of a-helica, secondary stiucure. Preferably, CD measurement a, 

see Creighton, Pro.etns: Secure and molecular propeWes, 2nd Ed., W. K Freeman 
i & Co., New York, New York (1993, and related texts), herein incorporated by 
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reference) A signal in the aromatic region around 280 nm is consistent with the 
« of Tr^yr, and Phe chromophores in an ordered environment, such as would 
IT Iteo in me h drophobic core of a folded protein. In general, assays for the 
^nTturified expressed proteins that employ solely biophysical cntena have been 

It is preferable to further characterize the isolated domam by H-NMR 
spectro p drably, the isolated domain is in a moderately concentrated sola * n 
( IOO pX a high d^rsion pattern of me proton resonance spectrum ts reported to 
h P characteristic of a well-folded polypeptide. 

" "-course of amide hy droge»-deu,erium exchange me— can also 
be performed on the isolated domain. From this, it is possible to obs^ve w tate 
backbone NH groups are significantly protected withtnthe domam. Signfican, 
! «T1 an JLtion mat me hydrogen-bonded secondary structure ts stabilized by 
«Z nteractions, which is consistent with a well-folded domam sttucture. 

F naUy mermaldenaruration experiments, monitored by intrinsic tryptophan 
fluoJe^Lalsobeperformed.TheseexperimenUarea.socapab.eofde.ermtmng 

whether the isolated domain is a compact domain structure. 

,n ri cipK mis is a general sttategy. Thus, i, can be used to parse many ge es 
inthehulangenometna, encode proteins of unknown biochemical functton mtotheu 

studl This general strategy can be easily modified to provtde a htgh-throughpu 

*■ v nr » tvnical 1 0 - 30 kD protein domain, 500 or 600 MHz one 
thp nresent invention. For atypical iu jvi^f .in 

Z ies^J^ofprotein. Using a continuous fiowNMR probe wi* , 

microcomputer-conuolled chromatography pump and simple 

♦ ^^iiv^reen 50 - 100 candidate domains per day for tolded 

«d domain structure can then be further validated using the other biophysical 
0 described above. An NMR spectrometer suitable for use in the present 

invention is a Varian Unity 500 spectrometer. 

C. High level expression and isotopic enrichment 

Uniform biosynuietic enrichment with »C and 'H isotopes ha, been 
reported to be a prerequisite for the analysis of macromoiecular sttuctures by NN» 
,5 XTo copy. S meNMRstrategieshavealsobeenreportedtobenefitfromrandom 
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enrichment with 'H isotopes. The principal obstacle for isotope-ennched protein 
production it most recombinant production systems is the high cos. 
media components (e.g. »C-glucose @ $330/g), and the limiting possibilities for «k- 
up to Jbd multi-liter fermenters. The less well-controlled conditio, o shak 
flask cultivations often result in lower protein production level, The * » 

»C- and/or J H-enriched proteins thus requires an efficient system cable of providing 
high level production of the desired protein in small-scale bioreactors. 

Under one preferred embodiment, the present invention employs a bacterial 
production system for "N, "C-enrichcd recombinant proteins. Preferably, the bacterial 
production system is based on intracellular production of recombinant proteins m lc* 
as fusion to an , g G-binding domain analogue, Z, derived from ^ 
A (Nilsson « «J, Protein Eng. MOMIJ (1987); Altman *< a/., fi* <-593-600 

(1991) both of which are herein incorporated by reference). In this system, 
transcription is initiated from the efficient promoter of the E. col, trp operon. This 
allows for efficient intracellular production of fusion proteins. These fusion proteins 
can men be purified by IgG affinity chromatography. Using mis approach « is possAle 
t0 achieve high-level (40 - 200 mg/L) production in defined minima, mediaof a number 
of isotope-enriched proteins (see, for example, Jansson e, at., J. Biomol. NMR 7:131- 

H1 '''under another preferred embodiment, the recombinant isotope-enriched domain 
protein may be produced using pET plasmid expression vectors (Studier 
Biol. 189- 1 1 3-1 30 (1986), herein incorporated by reference) under the control of the T7 
RNA polymerase promoter (see, for example, Newkirk e, ai, Proc. Nat 7 Acad Sc. 
mlZ W-51 14-5118 (1994); Chaterjee e, ai, J. BiocHcn,. 7N:663-669 (1993); and 
Shimotakahara ri,U« 56:6915-6929 (1997), all of which are herem 

incorporated by reference). recombinant proteins 

Under another preferred embodiment, in, ^, n « 
can be produced by acclimating a bacterial production system to grow in 95 A HA 
Recombinant bacterial production hosts le.g., the BL21 (DE3) strain] can be acchm*ed 
) to grow in 95% HO by successive passages in media containing increasing amounts of 
* » protein production levels of acclimated bacteria grow* in 95% Hfi are identical 
to those obtained in HA Using protiated [uniformly ''C-enriched,-glucose as the 
carbon source, 'H-enrichment levels of 70 - 80% can be achieved; high incorporation of 
H from the %0 solvent results from metabolic shuffling during ammo acid 
5 biosynthesis. While the resulting proteins are not 100% perdeuterated, they are 
sufficiently enriched for the purpose of slowing "C transverse relaxation rates and 
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plu.eLsan.piescana.sobeproducedusing^solven.and^form.y H, 

pnrichedl-elucose as the carbon source. 

Under on, purred embodiment, such isotope ^^j£L 
5 renamed by the method of Kin, - al which employs 

immobilized on a solid support. Km-*.** £ "^"<^;X*e 
incorporated by reference. The isotope enriched protetns can also be renarured by tn 
method of Maeda el al. which employs programmed reverse denaturant gradient 

cCles Grol^ES, dnaK, ana,, etc., may be used to assist in proton ft* ,g. 
Nilsson el al, An,. Re, UiaoM. ,5:607-635 (1991), heretn mcorporated by 

15 ^Lerably the fusion vectors are constructed to interface with downstream 

ZL even under harshly denaturing conditions, such as high concensus of 
rtoe hydrochloride and dithioftreitol. For such purposes, the preferred class of 
2 „ Z Tjoys protein-RNA fusions. Such fusion proteins can be punfied ustng 
oCnulot* affinity columns with high specificity in the presence of chaotroptc 

25 23*lll-112(199Q;Muiisfaier«f.,^ 

al, Bio/Technology 72:1119-1124 (1994) Cregg et al, BiofTechnology 77.905-910 
H993> all of which are herein incorporated by reference. 

Once the protein domain of interest has been expressed a. htgh levels, ,. ts 
necessary to purify large quantities of the protein domain * is 
30 characterization. Preferably, a. leas. 5-10 mg of fte protem * 
purified. More preferably, a. leas, 50 mg of the protem domam 

Methods for preparing large quantities of a gtven protem of suffictent pun * tor 

no, all methods for protein purification are appiicable ,0 a gtven protem of nte^ 
35 generally understood tha, the following methods represent pre erred emhodtments. 
affinity chromatography, ammonium sulfate precipitahon, dtalysts, FPLC 
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chromatography, ion exchange chromatography, utoacenttin.ga.ion, et. For a genera. 

older 1/. (eI), ft** ****** PP- 71 - 82 ' (198?>; ^ f^' , 
£L ML 1043* C (1984); Scopes, ft**. Portion: Proles W 
^ (Jed.), Springer-Verlag (19S7), and related te*s, an of wh.ch are herc,„ 
incorporated by reference. 

D. Rapid screening of NMR and crystallization 

properties 

One common problem for bo.h NMR analysis and crystallization is poor 
sdubility and/or slow precipitation of the protein samp.e. These properties^ ^hrgUy 
dependent on the pH, ionic strength, reducing agent concentr ation an d «*-^- 
of the buffer solvent. Thus, it is preferable to optimize these condmons to maxmnze 
solubiHtyfijr NMR analysis and to optimize the conditions for protein crystalhzation. 

Under one of the preferred embodiments of the present invenuon, the 
optimization experiments are conducted with an array of microdialysis buttons .0 
Z Z scan a pLlity of standardized buffer conditions to identify those most suttable 
ZZk studies and/or crystallization of each domain construct (Bagby J. W 
"279-282 (.997), incorporated by reference in its entirety). Preferably, each 
microdialysis button contains a, least 1 uL of a - 1 mM protein solution. More 
preferably each microdialysis button contains at leas, 5 uL of a ~1 mM pro em 
^ol The microdialysis buttons of the present invention are commerce — . 
EL*, each microdialysis button is dialyzed against about 50 m, of dralys.s buffer, 
such as in a 50 ml conical tube (Falcon). Preferably, the dialys,s .s P"' ^ 
However, the dialysis can be performed a, temperatures rangmg from 4 -40 C. Because 
« NMR studies are routinely performed a. room temperature for extended lengms of time, 
i, is preferable that the protein remain in solution under these conditions. 

Preferably, the protein samples are initially prepared in buffers contam. ng 50 /„ 
glyC erol (which is not suitable for NMR stitdies but generally provides gooc h£ft*W 
and then dialyzed against different buffers containing little or no glycerol. W.* respect 
„ ,„ NMRand X-ray crystallography studies, i, is understood ft* apersonof sM «. 
artwouMknowwhatbufferscouldbeusedtopreparetheprotemforstitdy. ThesWled 

lan typically has a set of 50-1 00 standard buffers which are used to prepare pr^tem 
subseoucn. studies. These buffers can then be modified ,f necessary to 
Zi-the protein preparation. The ability of a given protem - -am s„^ at 
,5 high concentration or form suhaMe crystals is dependent on the pH of the solution, as 
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well as the concentration of different salts, buffers, reagents, and temperature Thus, the 

t 1^ in a few days Preferably, multiple samples are analyzed m parallel. 

^mproteinhas remained in solution orwhemer the protemhas aggregated. 

the "button test" of the present invention, a single technician could see* 
Ability properties in 100 different buffers for -20 domains per week. Under the 
^er prefeL embodiment, these screens car, be carried out u,ng *» of the art 

soluble dynaL light scattering can be used to examine to disperse propert.es and 
:^lZJy in different buffer condition, Ferret Amare - allure 

JMS9 (1994), herein incorporated by reference. Alternately, Trp or Tyr 
fln«e anisotropy can be used ,0 measure rotational difftsion whtch ,s another 

of NMR properties, and all of the protein sanies which pass th,s stage of fte process 
wiHJdy mee, basic spectroscopic quality criteria. Standard cntena used ,0 

of skill in me art, include a good dispersion pattern artd a narrow peak wtdth, e c. 
s Preferably, gel filtration chromatography and dynamic light scattenng *Ua are 

' collecting ^course of domain purification. Such da. provide —on 

about the oligomerization s»te of me enriched samples 

Fordomainsoftheappropnates 1 ze(<-30kU),isotop.c y 

are scored in terms of their suitability for structure determination by NMR usmg 
0 standard 2D HSQC, 2D NOESY, and/or 2D CBCANH triple-resonance spectra. The 

Cvi^ood data in me full se, of experiments required for automate strucUje 
ZLln. For each ,S N, "C enriched domain, mis eva.uat.on ty P .cally reamres a, 
" - 10 mg of sample, and approximately 6 hours of NMR data c*c„on. 
!5 Preferably the evaluation is performed on about 10 mg of sample. Thus, -20 domams 
Z Z vlated per .,pectrometer-week" using the methods of the present .nvenuon. 
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v» as used herein means one skilled technician, working on one 

— Messrs; sssi- .« 

by X-ray crystallography. 
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The present invention employs advanced NMR data collection and automated 
f • tw data collection and automated analysis technologies 
analysis technologies. These data collection ^ ^ 

T , W-377 334 (1992)' Montelione « aL, Biochemistry 5/ 236-249 
a/., M-> d l , ,2-7839 7845 (1993); Rios ,< a/., J. Biomol. NMR 
(1992); Lyons * ai, /^^^ (1 997); Snimotakahara « 

5:345-350 (1996); Tashiro * ./., J. Uol B,ol 272 573 ( £ 

^6915-6929 (1997); Laity e, a,., ***** *f ™ q ^ 
Feng tf ■>/ J*** 37:10881-10896 (1998); and Swapana etal.J. W NMK 
9 105 0 997), a!, of which are herein incorporated by reference. Ttase dtfa 

^.egy for defining NMR resonance assignments m protems. — « • 
<»r OP* Struct. Bio. 5:664-673 (1995); and Zimmerman etalJ. ML B,ol. 

^nlonUym.t.ple^^^^^ 
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i AUTOASSIGN: Artificial intelligence methods 
for automated analysis of protein resonance 
assignments 

Resonance assignments form the basis for analysis of protein sttucttre and 
dynamics by NMR (Wttthrich, K-, NMR of Proteins and Nucleic Acids, John Wley & 
Sons NewYoncNewYorkC^S^hereinincorporatedbyreferenc^andthen 
determination represents a primary bottleneck in protein solution structure analyst, 
However, the inLuction of multi-dimensional triple-resonance NMR has d—, y 
improved the speed and rehability of the protein assignment process. Monte one , * , L 
J Map, Res. 8*183-188(1990); Ikum*«/., Pharmacol 

,990) Tashiro ctal, J. Uol Biol 272:573-590 (1997); Shimotakahara Btocnem. 
io-6915-6929 (1997); Lait, e,a,.,Biochen, 50:12683-12699 (1997), Feng etal, 
Biochem. 37:10881-10896 (1998), all of which are herein incorporated by «*■"«*• 

Preferably, the present invention employs AUTOASSIGN, an expert system that 
determines protein "N, "C, and 'H resonance assignments from a set of three- 
dimensio »a NMR spectra. Zimmerman - ai, Proceedings of, He First /— nal 

* at J. Biomol NMR <:241-256 (1994); Zimmerman e, ai, Curr. Optn Struct. B,o. 
5^4-673 (1995); Zimmerman e, a,., J. Uol Biol 269:592-610 (1997), all of whrch *e 
herein incorporated by reference. AUTOASSIGN has been copyrighted by Rutgers, the 
State University of New Jersey. Alternatively, the present invention can employ one of 
the following expert systems for the automated determination of protetn N, C and H 
resonance assignments from a set of three-dimensional NMR spe«ra. These mclude a 
modified version of FELIX which is available from Molecular Simulation (San Drego, 
CA) (Friedrichs e, al.J. Biomol NMR «03-726 (1994), incorporated by reference m 
its entirety). CONTRAST which is available from the world wide web at 
<<wv™.bmrb.»isc.edu/macroo/so ft _confras,htinl>> (Olsen and Marktey, J. Btomol 
NMR «85-410 (1994), incorporated by reference in its entirety), and a senes of small 
programs described by Meadows, J. Biomol NWR 4:79-86 (1994), incorporated by 

reference in its entirety. 

AUTOASSIGN is implemented in the Allegro Common Lisp Object System 
(CLOS) and requires a lisp compiler (available from Franz, Inc.) for execution. The 
software utilizes many of the analytical presses employed by NMR spectroscopy, 
including constraint-based reasoning and domain-specific knowledge-based meted. 
Fox e, a, Tne SU Canadian Proceedings in Artificial Intelligence 1986); Nadel e, ai, 
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Technical Report, DCS-TR-170, Computer Science Department Rutgers .Univ. (1986); 
lecnnicai^cpu , 19-44(1992), all of which are 

Kumar et al, Artificial Intelligence Mag., Spring, 32 44 ^i), 

incorporated by reference in their entirety. QQ ^ ^ 

In put to AUTOASS1GN includes a P-^^^^cbc^h. 
the following seven peak-picked 3D spectra: HNCO, CANH, CA(LU)N , 
CBcl(CO)NH, H(CA)NH, and H(CA)(CO)NH. This family of tnple-resonance 
CBLA^ujrm, ^ \ + , , f u at ttO ASSIGN to automatically determine 
experiment can be — ^ assignments foI ^ proteins 
extensive * * „ ^ , ^ m592 . 610 

rangmg « stze from 8 kD to 1 ^ shi motakaJ,ara e, al., Biochem. 

HQQ7V Tashiro etaL,J. Mol Biol. 

3« 5^29 (1997) Laity e, al, Biochen,. 30:12683-12699 (.997); Feng e, al, 
ZlTi 881- 0896 (.998). The program hartd.es some of the very cha.leng.ng 
It encountered in automated analysis, inc.uding missing spin systems, spm 
!lr«at overiap even in the 3D spectra, and extra spin systems due to multip.e 
; Si of the fo.ded protein structure (e.g. X-Pr„ peptide hond «*» 
rmerTzation). Execution times on a Sun Sparc 10 workstation range from .6.0 36 
T eLing on the complexity of me problem anaiyzed by the program. Frtri*. 
Te NMR spe^meter of me present invention is eouipped with three channels and a 
tuTfeqLy synthesizer for carbony. decoupling. Under another pretend 

"n me present invention, me AUTOASS1GN program provides for automated 
analysis of Jonance assignments for atoms of the polypepti^ ~ « 
the AUTOASSION program of the present invention provtdes for fully automate* 
,5 JU— ■ Havingesuabhshedassignmentsfortitebac^ne 

30 sidechain resonance assignments. It is additionally preferred that 3D N ed, ed 

NOESY and 3D "C-edited NOESY data are eoUected and automatically analyzed to 
confirm the resonance assignments. 

Under one of the preferred embodiments of the present 
AUTOASSION is designed to implement strategies that allow complete resonance 

35 S^^^^^^ F ° reXamPle ' SenSltmtye fr 
v XofHCCNH-TOCSYandHCC(CO)NH^ 
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romolete set of information required for the determination of resonance ass lg nment, 

?TolZ current 7-10 days to about half of this time. Zimmerman et al, 
assignments from the currenx / u uaj 3 o wio 7*4S n 993^ 

J bLoI NM «41-256 (1994); Lyons e, al, Bioc^ry 5*7839-7845 (1993), 

JVM* 3.487-493 (1993), demons trated that significant 

strateev described herein, will utilize H, U w ennwic F 

protein exchanges rapidly with the solvent H 2 0 used m the course of the prottn 
^ra,ion to y i eldu,epr„tia,ed''N-Ham i de g roup, This strategy can pmvde 
Ip^ — analysis of resonance assignments fo, fte carbon and nitrogen 

assists for Ore attached hydrogen atoms can be competed usmg 
HCCH-NOESY, and HCCH-TOCSY experiments. Correction factors for H-isotope 
"1 "r ach carbon site of the 20 amino acids can be determined usmg d*a 

+ - + a fnrm « W already been determined for these model proteins. 
Pro "wl — nUli.shightempcraturesuperc—g 

probes R^geltion versions of these probes are currently being marketed by 
Varian NMR hist. Inc. and Broker Inst. s^^to«-«^^«-^ 
^LsideLnH.C^dNass.gmnentstoless^onewee.cperdoma.n. 

2 Software for automated analysis of protein 
Q ' structures from NMR data 

Having completed the resonance assignments for a particuiar protein, the next 
™ B F . . „ M „f,h e mesent invention involves analyzing 
step of the structure determination process of the present in 

„ p^ W- *~ CHe*. Soc. ;,i=54«W492 (1991), Wish*. W 
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^,35 ,40(1995), both ofwhich are herein incorporated by reference^ This 
"• 135 140 i Il ned wit h other bioinformatics data derived from the protem 

,w Overhauser effect (NOE) data arising from magnetic 

— ^ 

Interpretation of these data from m ^^ned^aTaescrlbed above) in an 

"eTX^Vtet, — employs software for automated 
T '™I™ <he generation of input files for rapid structure 

The problems enc m ^ ^ 

largely to spectral overlaps, ,e it .s often the ^ 
very similar resonance frequence. One of the preferred appr , 
very sim iJ C . reso lved NOESY experiments (Clore et al. , 

problem ,s to use 3E (or D) IN I « _ Mo/ . 

A»i Jtev. Biop^. 2039 SUM >. (1994 ),allof 

involved in the NOE intera ^ 

i xnn d-iQ 96 CI 994) herein incorporated by reference. 
"^T«X: TchLnsea — — - 

candidate assignments of the remaining unassigned NOESY cross pe 
55 TnlT— r^P^ofthepresent — isaC" program. 
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A„TO STRUCTURE is a CTprogram .ha, analyzes 2D and 3D NOESY spec.ra.0 

these crosspeaK as g STRUCTURE can also use a low-resolution 

(hat are no. uniquely signed, removing po.en.ial NOE ass,gnmen,s d,a. are severely 

mat are nui U1114 j a t tto ^TR\ JCTURE propagates the 

inconsistent with the ^solution struc^ 

structural constraints imposed by the umquely ass lg ned NOEs to determm g 

STRU CTURE can successfully analyze 
of otherwise ambiguous NOEs. AUTO_£> 1 ku^ 1 stmct ures of 

NOESY spectra and, in an i.era.ive fashion, automa,., be 

used in die present invention include GARAN 1 (wutnr j 

incorporated by reference in its entirety), ARIA (Michael Nilges, J. Mol. B,ol. 245 645 
Z ^Cilporated by reference in its entirety) and NOAH (Mumendialer and 
bZTL. I 25*465-420 (1995), incorporated by reference in * entirety^ 
Preferably, ft. auto sttucture program of the present invention provide for 

embodiment, die auto struck program of die present invent, 

application of pattern recognition algonthms. 

biochemical functions 

Preferably, die resulting domain structures derived from NMR or X-ray 
v reierduiv, suitable databases of 

crvstalloeraphic analyses are compared with the PDB or otner sun 
crystaiiogrdpniv } Qtrnc tiire homology matching. 

ftlTlin Da« Base (PDB), which can be tod a. h«^« 
^oriftms for 3D— homology matching ^ 23 , 38 
invention include the DALI analysis program (Holm e, al, J. Mol. B.ol.233.mu 
mvenuon in Tefere nce) the CATH analysis program (Orengo, C. A. 

15 (1993), herein incorporated by reference), ure v,™ 



Structure 5-1093-1 108 (1997), herein incorporated by reference), VAST 

of which are incorporated hy reference in their entirety) or smular algonthms for 3D 

3D structure and provides a Hst of PDB entries with high match scores. Based on 
current - W rates by new,y-de,ermi„ed sutures against already known fo ds Ho m 
a, Methods Er^mol 266:653-662 (1996); Hoim et at., Scence 273:595-603 (1996), 
hid, of which Z herein incorporated by reference), it is expect that greater than 50% of 
fte structures wii, show significant structural and functional homoiogy to protems of 

known structure and function. 

n order to facilitate and enhance the ability to identic common btochemtca. 
functions for these DALI hits, it is preferab.e to deve.op a su.cture-^on knowledge 
base (Figure 1), correlating each protein structure in the PDB w,th the se, of 
b^henLl functions that have been associated with that protein m the pub shed 
lentific literature. Where information is available, this knowledge base wdl a!so 
the portions of toe know, protein structures with corresponding specific 
Chemical Lions (e.g., enzymatic active sites or -'^"^ 
fold-function knowledge base is applicable to a wide range of stiuctural tomformatics 
applications, and of significant utility to the nascent industry of stiuctural 

"""novel protein domains with clear homologies to betier-ch— 
counterparts have been identified, the proposed functions can be validated usmg 

assays. For exampie, if a protein looks Hke a member of *e galac osy 
Lsferase family, the protein will be tested for radioactive UDP-galactose or other 
Z£L> binding, if it looks like a tipase, the protein wil, be tested for hptd bmchng 
anoVor hydrolysis activity, and so on. 

G Integration into a large-scale, high-throughput 

* en|ne'' for structural and functional analysis of 
hundreds of human genes 

Under one preferred embodiment, the present invention provides for a "structure 
. function analysis engine" capable of high-throughput discovery of biochemical 
functions of new human disease genes and genes of unknown function. 
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Using conventional methodology, the skilled artisan may be able to determine 
the 3D structure of one protein per year. However, using the methodology of the 

protein per year. Under optimal conditions, the present mvenuon wul enable a properly 
quipped Oratory to generate the 3D sttucture of one protein per month perNMR 

struct of protein domains of unknown function a, a rate whtch ,s faster than the rate 
a, which a skilled artisan could determine a protein structure using tradmonal 

oTthe central features of the present invention is tha, i, is highly sca^ble. 
Under one of ft. preferred embodiments, the high-throughpu, .'engine" ^° fa 
dedicated laboratory staffed with artisans skilled in relevant arts (e.g., NMR and X Ray 
SSU. molecular biology, biochemistry, etc.). Preferably, such a laboratory ,s 
further equipped with state of the art equipment for the sequencing, sub-clonmg 

ZJ^Z^ — -* -* sis of 46 p r in h domai r ; f^ 

rate toning component of ftis high-throughpn, "engme" - the number of NMR 
machines within the laboratory. Thus, the rate a. which protem domams can be 
characterized will increase with the addition of additional NMR mac ,„es . m to 
conventiona! meftodology, the present invention provides a method for —g 
3D structure of urdotown protein domains whose rate vs no, solely dependent on the 
number of artisans skilled in 3D protein structure determmahon. 

Therateof domain characterization increases as each of fte tasks wtach are 

presently conducted by hand are automated. For example, under one of the preferred 

embodiments, the parsing of the unknown gene into its component 

facilitated through the use of advanced sequence analys.s algonthms. 

tie preferred embodiments, the rate of domain characterize™ ,s mcreased through the 

Z of improved computer software for the automated analysis of NMR datapomts. 

A Lugh «he presen, invention is drawn to usingNMRto determme pro em 
structure and function, i, is to be understood ft* a person of skill in the art could 
, £L simi.ar analysis using X-ray crystaUography to practice tire presen mvennon. 
Shapiro mi Lima,.. S.rucure 6:265-267 (1998); Gaasteriand, MM. B,o,ecH. d.625- 
627 (1998); TerwiUiger e, al Pro,. Sc, 7:185,..856 (1998); Kim, Naure Secure 
Biology (Synchrotron Supp.): 643-645 (1998), all of which are incorporated by 
reference in their entirety. 
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V. SPECIFIC GENE TARGETS 

Preferably, the specific gene targets that will be analyzed using the present 

rMolmica. function and three-dimensional structures of the protons encode* by 
the biochemical ^ domains will be analyzed using the high- 

targeted to these human gene products. 

Although the present invention is principally drawn to human genomic, cDNA 

and mCZence , « b » <« ** te ^ ". 8 

^1^ genomic cDNA «. mRNA sequences of any living organism or virus. 

W W the resent invention iscapab.e of determining *e funcuon of any 

.1 domain the preferred biomedical gene targets of the present 
given protein or protein domain, the preierr , . , APn Additional preferred 

invention include Alzheimer's P peptide precursor protein (APP). Addmo p 
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discovery efforts. 

Having now generally described the invention, the same will be more ^readily 
..lustration and are not intended to be limiting on the present invention. 

FX AMPLE 1 

PARSING OF THE APP GENE INTO 
DOMAIN-ENCODING REGIONS 

a Parsing by the exon phase rule 

Gene S 7:257-263 (1 990)) was subjected to a parsing analysis w«h respect to the phases 
of its exon-exon boundaries: 
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F„ nn .pvnn boundary ^hase 
0 



1 - 2 

2 - 3 

3 - 4 

4 - 5 

5 - 6 

6 - 7 

7 - 8 

8 - 9 

9 -10 

10- 11 

11- 12 

12- 13 

13- 14 

14- 15 

15- 16 

16- 17 

17- 18 



0 
1 
0 



0 
0 
0 
0 



0 



u i, onlv exons or exon combinations that start or stop in 
Using the exon phase rule, only exons or ^ ^ 

the same phase are allowed. For example, 7 ^ 10+1 1? md exon s 

encoding regions 

10+1 1+12 would be potential domain encoding regions witn p 

B . Thelongestpolypeptide 
The APP gene is reported to be altemativ y p ^ ^ 

encodedby*eAPPgeneis7" 

missing the amino acids encoded by exons 7, 8, ^d/or 5 ^ 

j . . 777 . 981 907 H996), herein incorporated by reterence;. ^ 
^carf. Sci. 777:281-287 n Alternative splicing 

wh ich are alternatively spliced are bounded gin 
must be done in such a way as to not d sp Uced 
without destroying essential folding information)^ Th e * 

— eph^ 

phase 1 exon boundaries, that is, pnase 

candidate boundaries of domain encoding regions. 



C Setting the phase with known internal domain structures 

Exon 7 of APP is known to encode a complete domain for a Kuni.z-.ype senne 
please inhibitor (Hynes „ a,, «o*-*ry 29:10018-10022 (.990)). The Kumtz 
Lhibitor is a domain that has been combinatory shuffled around » vanov* genes 
devolution (Patty, L. Cun. Op,n. Sl r«c<. Biol 7:351-361 (1991)), and for the 
ZJZ ab-eH would have to be inserted only into proteins with other omams 
X *L phase in order to no. disrupt gene expression. Wore to analys,s s 
so consistent with APP being composed of domains which are bounded by phase 1 
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exon termini. 



D The "N-termtaus first" strategy of parsing 

In' order to reduce the combinatorial complexity of the parsing problems, an N- 
terminus first" strategy is preferred. In this parsing strategy, expression constructs of 
^ Te Tmains are made starting from *. N-terminus of the protein and extendmg «o 
Z likely C-termini as predicted by the above rules. The. construct, are pu, through 
the "domain trapping" test of the present invention in order to identrfy the first N- 
1 „a, domain Then, once the firstN-termina, domain is identified, a second se of 
lsm.cs commencing from the C-terminus of the firs, N-termina. domam ,s made, 

i " ,d 80 tothe case of APP, the N-terminus of the protein starts with exon 2 because 
exon 1 encodes a signal peptide. Therefore, the possible domain constructs that ended 
TpL 1 boundaries were exons 2-3 and exons 2-6 (exon 7 was known to encode the 
kLz inhibitor domain). By the domain trapping criteria exons 2- were found to 
encode the firs, N-termina) domain, so a second construct compose »f exo s 4- was 
m ade and found to contain me second domain of APP, and so on. A summary of *e 
APP domains identified by mis combination of parsing and domam trappmg , grven 



below: „ 

Fnr.ndin g Exons 

Domain 

. 2-3 

1 (N-terminal domain) 

4-6 

2 

7 

30 3 (Kunitiz inhibitor) 

8 

4 

etc. 
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F X AMPLE 2 

EXPRESSION AND PURIFICATION OF AN ISOLATED DOMAIN 

secretion-based protein A fusion expression system and ' 
Methods En^mol 1»:144-161 (1990), herein incorporated by reference. 

IT V AMPLE 3 

EXPRESSION AND PURIFICATION OF AN ISOLATED DOMAIN 
FOR NMR ANALYSIS 

rwrin Expression «„;„„l,„.l Competent RV308 

E coli strata RV308 is used as the bactenal expression host. Compe 

Y lasmid comainmg the NTD 2-3, Z domain msert. 
cells are transformed writ P™ a ^ 10 0 g/ml 

Cells are grown overmght at 37 C on LB agar p w ^ 
ampicillin (Sigma). Fresh transforms are «, to ^ ,„„ 

pgtal ampieiUin. Cultures are grown overmght a, 30 Cm 250 

expression culture (2.5 g/1 NH 4 suiiaxe ^ F „ r _ lucose (>9 8o/ 0 

poLum phosphate buffer, pH 6.6, supplement «* 5 , fl C**~ I 
purity), 1 g/1 magnesium sulfate, TOmg/i ^ ° 

solution, 1 ml of .000 x vitamin solution and 00 -P- J 
culture is spun down by centrifiigation. Bactenal pe eU ■ « P 
fresh MJ media, and used to inoculate express cultures. Cultures are gr 
tresnMjmcu , , nn» n 9 - 1 0 with indole acrylic acid to a final 

1 1 haffled flasks and induced at OD 0.9-l.uwimi. ... 
2 j» e^ion of 20 mg/1 . Cultures are harvested 1 5 hours after mdudon by 
centrifugation. Bacterial pellets are stored a. 20°C until purification. 

^1 are upended in 1 00 ml of 25 mM Tris, P H S , 5 mMEDTA, 

fold by dialysis against twenty volumes of 10 mM HC1 . 
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, g 0 affinity purification is used to purify the NTD 2-3, Z domam ft- n from 
any conlnating proteins. The 10 mM HC, protein solution ^ «" 

JL I M Tris, P H 8.0. The sample is then applied to an lg G sepharose colvunn 
Pharmacia) pre-equilibmted with TST buffer. The column ,s washed wtth ,10 bed 

Z volumes of 5 mM ammonium acetate, pH 5.0. Finally, the proton » e.uted wrth 0.5 
M ^racid pH3.4. In preparation for refolding, the protein eluate » neutralr^d to 
ThTo wrdtthd Tris, and an eo.ua, vo.ume of 7 M guanidine is added to bnng the final 
euanidine concentration to 3.5 M. ^wthp 
8 Refolding of the protein is carried out by using dialys.s to slowly dnute out the 
guanidine HC1 while slowly introducing the refolding buffer. F.rstly, SpecWPOR 
aTysis tubing with a MWCO of 6000-8000 is soaked overnight m water ,„ order to 
remote glyceLl. Next, the protein solution is loaded into the pruned tubmg and 
ZlSnst fiesh refolding buffer. The dialysis reaction is incubated for two days 
Tic wifc magnetic stirring. Refolded protein is men concentrated usnrg an IgG 
J£l column pre-cuilibrated with TST buffer. Bound protein , e.ufcd w.* 0.5 M 
placid and collected in fractions in order to keep the volume as low as po.rble. 
Refolded ftsion protein is men toner purified by ge, filtration or, a P— 
Suoerdex 75 FPLC column using 300 mM ammomum b.carbonate, 0.1 mM copper 
!l,fas L buffer. Fractions corresponding to tire ftsion protein are pooled, and tire 
protein is quantised using the optical density at 280 nm 

Cleavage of the fusion protein is carried ou, ustng Genenase I (NEB), .«< 
of subtilisinBPN'. Fusion protein is buffer exchanged into Genenase bufe mM 
Tris p H8.0,200mMNaCl,0.02%NaN ! ,usinganAm,constirce.l. "em 
IZtration is adjusted to 2 mg/ml and Genenase is added to a concentration f .2 
Zml . The re Jon is incubated a. room temperature for 4 days and the extent of 
cleavage was followed using SDS-PAGE. Cleaved NTD 2-3 is separated from 
unZed ftsion and Z domain by passing the solution over an IgG comm. , an 
Meeting the unbound NTD 2-3 in the flow through. The NTD ,s then punned from 
, Genenase by ge. filtration on a Pharmacia Superdex 75 FPLC column usmg 300 mM 
ammonium bicarbonate, 0.1 mM copper sulfate as the buffer. 
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FV AMPLE 4 

DOMAIN TRAPPINGtCHARACTERIZATION 
OF AN ISOLATED DOMAIN 

Characterization of an isolated domain (NTD2-3) from the Alzheimer's amyloid 
precursor protein (APP) by circular dichroism measurements in the far UV sho«s an 
Ihpticity minimum at222 nm, indicative of a-helical secondary structure (Figure 2A). 
Wven greater significance, CD me_*s a, .onger wavelengths reveaW clear 
signal in the agnatic region around 280 nm, consistent win, the presence of W, 
L Phe chromophores in an ordered environment such as would be expected .» the 
hydrophobic cor! of a folded protein (Figure 2B). A u—y —ted so.ut.on 
(-100 |A0 of the isolated N-terminal domain is further characterized by one- 
dimensional 'H-NMR. The isolated recombinant APP N-terminal domain exhibits high 

sz. °f ^ — wuch is a ^ ° f we "" foid 

^Mime^ourseofanude hydrogen-deuterium exchange measurements is 
performed. From this, it is observed that many backbone NH groups exhibit significant 
protection, indicating hydrogen-bonded secondary structure .tabihz^ by tertiary 
leractions consistent with a well-folded domain structure (Figure 4). Finally, thermal 
denaturation experiments, monitored by intrinsic tryptophan fluorescence are 

rformed. These experiments show that the recombinant APP NTD2-3 domain 
undergoes a cooperative thermal unfolding transition, with a T, of approximate* 60 C, 
indicative of a compact domain structure (Figure 5). 

Thus, using biophysical data alone, it is demonstrated that the NTD2-3 domain 
of APP, encoded by exons 2 and 3, is expressed as a well ordered tertian , ™ . 
Chiang e,al, torn** Aging, Supplement Vol. 17, No. 4S, abstract* (1996> 
Simil* studies indicate drat the next APP N-termina. domain is encoded by exons 4-6, 
the third (Kunitz) domain by exon 7, and so on. 

F.X AMPLE 5 

NMR CHARACTERIZATION OF THE NTD 2-3 DOMAIN 

For NMR studies NTD 2-3 is concentrated to concentrations greater than 1 0 
mg / m l . Gel filtration pure NTD 2-3 is first buffer exchanged into a NMR compare 
buffer, 20 mM potassium phosphate, P H 6.5 using an Amicon stir cell. The protem 
olutinis then concentrated to an appropriate volume based on the amount of protem 
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presen. using the Amicon 50 and Amicon 3 stir ceUs. The fine, protein concentration is 

"NflSQC specttal analysis is shown in Figure 6. The good d,spers,o„ ,n both the N 
ICtaensionsdemons.ra.etha.n.isisafoideddomain.ha.hasbeen .rapped by 

the presently described methods. 

r.Y AMPLE 5 

COMPARISON OF THE NMR STRUCTURE OF CSPA WITH OTHER PROTEINS 

Recombinant CspA is expressed and purified using the protocol essentia!* as 
described by Chanerjee - al,J. BiocHe m . W :663-669 (1993), »«eng«^ 
Biochemist 37:10881-10896 (1998), both of which are mcorporated by reference 

I sample is analyzed using a Varian Unity 500 spectrometer equipped wrth tee 
£Z and a fourth fluency synthesizer for earbony. decouplmg — d by 
Feng«./.,a'oc/, £m ^^7:10881-10896( 1 998). Figure 7 proves the 2D N H 

HSQCspectrumofCSPAatpH6.0and30°C ]T0ASSI0N The inp „, for 

The collected spin resonances are analyzed using AUTOASSION. P 

inecouec v „,n«N'H" HSOC and 3D HNCO spectra along 

si ttoaSSIGN includes peaks from 2D N- H nov>-'"'« J r 

ZZ ~ urree intraresidue (CANH, CBCANH and HCANH) and three 

tole^ue (CA(CO)NH, CBCA(CO)NH and HCA(CO)NH) expenments whjch 

T, * ,he C C and H" resonances of residues / and i-1 respecfvely. The 

summarized in Table 1 . 

Side chain resonance assignments are obtained using PFG HCCNH-TOC^Y and 
PFG HCC(CO)NH-TOCSY and homonuclear TOCSY experiments recorded with 
PFG CM )IN 36 45 54 71 ^90 ms according to the method of Celda 

multiple mixing times of 22, 36, 45, rP f erenC e in its 

andMontelione, J Up WM:1»-1» (1993), -grated hi -reference uuts 

entirety Interatomic distance constraints are derived from three NOESY dau, sets 2D 
„ S and 3D »N-edi<ed NOESY-HSQC spectm recorded with a m.xmg ..me of ,„ of 

is recorded wi.h a mixing .ime of 50 ms of a sample dvsso.ved m 100/. HA The 
inl* of *e NOESY-HSQC spectrum is corrected for N relaxat,on effects, and me 
cross-peak intensities are converted into interproton distance constants. 




and vicinal coupling constant data using ,he HYPER program. HYPER « 

Crejero - «/., V. Start M« 0« P->- i-rpora,ed by itfe^ce in ~y). 
Undarystn.cturalelemen.sofCspAaresummar.zedmF.gureS. " s 
SlaL, flve NU-as corresponding «o poiypepude segments of residue 5-13, 
10 22, 30-33, 50-56 and 63-70 are identified. 

The average number of distance contraints per residue iua 
co.tra.Lareo.Ledfrom.eHYPERpro^. rT~- 
are carried out with DIANA, version 2.8 TRIPOS, Inc.) using MOM 
^il Graphics Onyx worksunon (Braun and Go, J. Mot Biol ,86:61 1-626 (1985) 
15 Z7^UMo, Biol. ^949-96, (.983), both of which „ incorporated by 

defined. Using me refined CspA coordinates defined hy the presentmven,,c« 

homologues. Identified structural homologies of CspA exhibiting sum 
Cio include the RNA binding domain of £ .Kpolyribonucleo^ 
nueleoddyltransferase, the human mitochondria. ssDNA-binding protein, E. co„ 
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nation initiation factor 1, the ssDNA-binding protein from geneV of ftiamento » 
bacteriophages M13 and fl, the ssDNA-binding protein ^ ^ 
elongation faetor G from Thermus .hemophilus, a domarn of £ col, lysyl <RNA 
synthetase, a domain of yeast tRNA synthetase, human replication protem A, 
sLphylccoccus nuclease, and a domain of E. coli topo.somerase 1. ^hthe 
faction of CspA was already know, the present Example has Ulustrated me use of the 

PreOTt :;:ep 0 Lntinve„tiondesorihes,aperso„„f skill in me art is able to takea 
peptide of unknown Motion, express and purify a s»Me peptide domam encoded 
by the polypeptide, determine the NMR 3D structure of that expressed domain 
pll me Action of that domain by comparing the structure of mat domam agatns. 

structures having known functions. This represents a fundament paradtgm 
shift in the study of proteins. 

made in the present invention without departing from the spm. and scope of tite present 
taction. It wiU be additionally apparent ,„ those skill* in the art ma. the baste 
construction of the present invention is intended to cover any vacations, uses or 
anions of the iLntion following, in general, the principle of me mvention and 
including such departures from the present disclosure as come wtthm known or 

practice withinthe art to which the invention pertains. Therefore, *w* be 
a P li7«l that the scope of this invention is to be defined by the clatms appende 

ZZZ- - ** -** — b which have presented 35 £Xamp 



